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Abstract. We give a new proof of VC bounds where we avoid the use of 
symmetrization and use a shadow sample of arbitrary size. We also improve 
on the variance term. This results in better constants, as shown on numerical 
examples. Moreover our bounds still hold for non identically distributed inde- 
pendent random variables. 
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1. Description of the problem 

Let (XjS) be some measurable space and ^ some finite set. Let (0,T) be a 
measurable parameter space and {/§ : X — * ^,6 £ 0} be a family of decision 
functions. Assume that 

(M)^/«W:(exI,T§B)^y 

is measurable. Let 

Pi 6M|(Xxy,S®{0,lp), i=l,...,N, 

be some probability distributions on X x y — where {0, 1}^ is the discrete sigma 
algebra of all the subsets of y. Let (Xi, Yi)f =1 be the canonical process on (X x y) N 
— i.e. the coordinate process (Xi,Yi)(ui) — w», ui £ (X x . Let 



1 N 



i=l 



We are interested in bounding with P% probability at least 1 — e and for any 

9 £ O the quantity R(9) — r(9). This question has an interest both in statistical 
learning theory and in empirical process theory. 
In the case when \y\ = 2, introducing the notation 



N(xn = |{[/ e (x i )];: 1 ^ e e}|, 

where \A\ is the number of elements of the set A, Vapnik proved in [H3 page 138] 
that 

Theorem 1.1. For any probability distribution P £ M^X x y), with P® N prob- 
ability at least 1 — e, for any 9 £ 9, 
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where 

d> = log{p® 2N [X(X* N )] } + log(4e- 1 ). 

It is also well known since the works of Vapnik and Cervonenkis that, in the case 
when Id = {0,1}, 

log[W(XD]<Mog^), 

where 

h = m ax {\A\;X[(X i ) ieA ] = 2^}. 

Therefore when the VC dimension of {/#; 9 <E 0} is not greater than h, that is when 
by definition 

max{\A\;AcX,\{(f e (x)) xeA ;9ee}\ = 2^} < h, 
we have the following 

Corollary 1.2. When the VC dimension of {/g; 9 € &} is not greater than h, with 
P® N probability at least 1 — e, for any 9 £ Q, 

where 

d' = Mog^)+log(4 e - 1 ). 

The aim of this paper is to improve theorem 11.11 and its corollary, using PAC- 
Bayesian inequalities with data dependent priors. 

We have already proved in [5] that with P® N probability at least 1 — e, 

where 

which brings an improvement when r(9) < and d is large. 

Here we are going to generalize this theorem to arbitrary shadow sample sizes 
and non identically distributed independent random variables. We will also improve 
on the variance term in l|l.lf) and get rid of the (unwanted !) parameter £. 

Moreover, we will derive VC bounds in the transductive setting in which the 
shadow sample error rate is bounded in terms of the empirical error rate (in this 
setting the shadow sample would more appropriately be described as a test set). 

We will start with the transductive setting, since it has an interest of its own 
and will in the same time serve as a technical step towards more classical results. 
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2. The transductive setting 

We will consider a shadow sample of size kN where fc is some integer. 
Let (Xi,Yi)V?^ N be the canonical process on (X x yy k+1 > N _ 
We assume that we observe the first sample (X,, Yi)f =11 that we may also observe 
the rest of the design X^t^ , (this is a short notation for (^i) 4=^44) j but that 
we do not observe Y^Vj 1 . 

Let r\{6) and V2{0) be the empirical error rates of the decision function fg on the 
training and test sets: 

1 N 

nW = -^i[F^/ e (x i )], 

1=1 

(fc+l)AT 



fciV ^ 

i=7V+l 



Let P e Mi 



(Xxy) (fc+1)w ] be some partially exchangeable probability distribution 



on (Xxy) [K+L>1 \ What we mean by partially exchangeable will be precisely defined 



in the following. An important case is when P = I (^^.j Pij , meaning that 

we have (fc + f ) independent samples, each being distributed according to the same 
product of non identical probability distributions. Let as in the introduction 



x(xi k+ v N ) = \{[f e (x i )]lf N -.eee} 



be the number of distinct decision rules induced by the model on the design 
p^fc+ijw Wc wm prQve 

Theorem 2.1. With P probability at least 1 — e, for any 8 £ 9, 




N N 2 ' 



where d = log[N(Xf +1)IS )\ + log^" 1 ). 

Let us remind that when |y| = 2 and the VC dimension of {fe',0 £ 9} is not 
greater than h, 

d < tlos (£(_ti)_) +log(e - 1) . 

Let us take some numerical example : when N = 1000, h = 10, e = 0.01 and 
r\(9) = 0.2, we get r^iff) < 0.4872 using k — 4 (whereas for fc = 1 we get only 
T2{&) < 0.5098, showing that increasing the shadow sample size is useful to get a 
bound less than 0.5) 

Let us start the proof of theorem l2.1l with some notations and a few lemmas. Let 
Xi = 1 [Yi ^ fe{Xi)\ £ {0, 1}. For any random variable h : fi = (X x y) {k+1)N _> p 
( we work on the canonical space), let the transformed random variable Ti(h) be 
defined as 

1 k 
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where t- : ft — > f2 is defined by 



Wj +m „, £ = i+ + j) mod (k + 1)]N, m = 0,...,k; 
LUi, I g {i + mN : m — 0, . . . , k}. 



In other words, rf performs a circular permutation of the subset of indices 
{i + mN : m — 0, . . . , k}. Notice also that may be viewed as a regular con- 
ditional probability measure. 

Definition 2.1. The joint distribution P is said to be partially exchangeable when 
for any i = 1, . . . , N, any j = 0, . . . , k, P o (r/) _1 = P. 

Equivalently, this means that for any bounded random variable h, 

P(/i) = P(/ior/), i=l,...,N, 

(since t- is the jth iterate of t}). As a result, any partially exchangeable distribu- 
tion P is such that for any bounded random variable 

p(/0 = p{[O£iTi] (h)}, 

where we have used the notation Oti r i = T i ° T 2 ° ■ ■ ■ ° t n- 
In the same way 



Definition 2.2. A random variable h : (X x y^ fe+1 ' Ar 
exchangeable when for any i = 1, . . . , N, h o t\ = h. 



P is said to be partially 



(fc+l)JV 



Lemma 2.2. For any 9 £ 8, any u> G (l x ^) " , any positive partially ex- 
changeable random variable \, any partially exchangeable random variable r\ 7 



(co). 



(o£iTi){exp[A[r2(0)-ri(0)] - v ]}(uj) < exp[^n(e) + r 2 (6)] - 
Proof. 

(Olin){c^[\[r 2 (6) - ri (6)] - v ]} 

= exp(-r?) J] n jexp (A £ Xi+jN - ± x <) } 

i— 1 ^ ^ j=l ' ^ 

= expt-r?) [] cxp (A J2 Xi+jN) J] ^ | CXp (~ ^V^ X ») I ' 

i=l v j=0 7 i=l k v ' 7 J 

Let ^ = jrj-j- J2j=o Xi+jN ■ Let x be the identity (seen as the canonical process) on 
{0, 1} and Bp be the Bernoulli distribution on {0, 1} with parameter p, namely let 
B p (l) = 1 — B p (0) = p. It is easily seen that 



logi t, 



cxp 



(fc + l)A 

kN ' 



= logi b p, ex P ~ 



( fc + l)A _ 



Moreover this last quantity can be bounded in the following way. 

log{B p [exp(-ax)] } - ~aB p ( X ) + J (1-/3) Var B/(3) ( X )d(3. 
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This is the Taylor expansion of order two of log|_B p [exp(— a\)] j, where 
B p [xexp(-/3x)] _ pexp(-/3) 



/(/?) = 



B p [exp(-/3 X )] (l-p)+pexp(-/3) 



<P- 



Thus 



Var B/(/J) ( X ) = /(/?) [1 - /(/?)] < (p A ±) [l - (p A ±)] < (1 - ^)p = ^p, 
for any p e (^-IN) n [0, 1]. Hence 

log|B p [exp(-ax)] } < -ap + 2 (fc + 



and 



Therefore 



exp 



(fc + l)A 

fcA ' 



< exp 



(fc + l)A (fc+l)A 5 



kN 



-Pi + 



2kN 2 



Pi 



(o£iTi)|exp[A[r 2 (0)-ri(0)] -»?]}< cxp(-ry)cxp 



(fc + l)A 2 
2fcA 2 



N 

E 



Pi 



= exp 



(2^ E X i -r ? J=exp{-[Ir lW+ r 2W ]-, 



□ 



Lemma 2.3. For any 6 € O, /or any positive partially exchangeable random vari- 
able X, for any partially exchangeable random variable rj, 

p{exp[A[r 2 (0)-n(0)] -r?]} < p{exp[^[£n(0) +r 2 (0)] -r?]}. 

Remark 2.1. Let us notice that we do not need intcgrability conditions, and that 
the previous inequality between expectations of positive random variables holds in 
E, + U {+00}, meaning that both members may be equal to +00. 

Remark 2.2. We can take 77 = log(e _1 ) + ^ [iri(fl) + r 2 (0)] to get 

p{exp[A[r 2 (tf) - n(0)] - ^ [£n(0) + r 2 {6)} + log(e)] } < e. 

Proof. According to the previous lemma, 

p{exp[A[r 2 (0)-n(0)] - rj] } 

= P{(0£i7"i) {cxp[A[r 2 (0) - ri(0)] - /?] } 



< P<^ exp 



r \2 



2A 



□ 



Let us now consider some partially exchangeable prior distribution tt G M+(0): 
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Definition 2.3. A regular conditional probability distribution 

is said to be partially exchangeable when for any 

i = 1,...,N, any u £ (X x y)^ 1 ^, 7r [r/ (a;)] = vr(tj), this being an equality 
between probability measures in M+(8,T). 

In the following, A and 7/ will be random variables depending on the parameter 
9. We will say that a real random variable /i: fix y^( k+1 ^ N x ^ P is partially 
exchangeable when h(u,9) = ft-[r/(w), 9] , i = 1, . . . ,N, 9 e 6, w € (X X y) (fe+1)Ar . 

Lemma 2.4. For cm?/ partially exchangeable prior distribution n, any positive par- 
tially exchangeable random variable A : (X Q —* P ; and any partially 
exchangeable random threshold function r\ : (X X y^' c+1 ' )Ar x -i R, 



p|^{exp[A(0)[r 2 (0) - n(0)] - 77(0)] } 

< p|7t|< 



AW 2 ri 



:ri(9) + r 2 (9)]-r](9) 



2N lk 

Proof. It is a consequence of lemma 12^1 and of the following identities: 



p|^{exp[A(0)[r 2 (0) - n(9)] - t?(0)] } 

= P{ (O^iTi) (^{exp^tf) [r 2 (0) - n(0)] - 7j{0)] } 

= ^{(OiV*) exp[A(0)[r 2 (0) - n(0)] - 7,(0)] } 



Indeed for any positive random variable ft, : (X X y^ fe+1 - )A ' x0^E, 
tt(/i) or/ = (tt or/) (ft or/) =tt(/iot/). 

Thus 



M*)] = ^X>(^)=-( F hX> 

i=o v i=o 



As a consequence, we get the following learning theorem: 



n(nh). 



a 



Theorem 2.5. For any partially exchangeable prior distribution tt, any positive 
partially exchangeable random variable X, with P probability at least 1 — e, for any 

p[\{9)r 2 {9)} - P [\(9)n(0)] < p{^- [i [n(0) + r 2 (0)] } + 3C(p, tt) + log^ 1 ). 



Proof. Take 77(0) = [i r i(^) + r 2(#)] + l°g( e 1 ) an d notice that it is indeed a 
partially exchangeable threshold function. 

I.\ii'iM)\ i-.n Yapxik ( oxi \ ...- iioi'Mis ' 
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Thus 



]?<! sup p[\(6)r 2 (9)] -p[A(d)n( 
"A(0) 2 



2iV 



■{|pM#)]+pM0)]} 



3C(p,7r)+log(e) <0 



= P 



jl0g{7T 



exp{A(0) [r 2 (0)-ri 

-Affi!ri ri( , } 



2N 



>2( 



log(e)}]} 



< 



< P^ 7T 



e*p{A(0)[r2(0)-ri(0)] 



r 2 (0)] +log( £ )} 



< e. 



We have used the identity log|7r[exp(/i)] j = sup peM a ( ) p(h) — X(p, 7r). See for 
instance 01 pages 159-160] or lemma 4.2] for a proof. □ 

Let us consider the map '5 : 8 — > y( fe+1 ) A ' which restricts each classification rule 
to the design: \&(0) = [/e(AT,)] Let 0/* be the set of components of 9 for 

the equivalence relation {(01,0a) £ 9 2 ;*(0i) = *(0 2 )}. Let c : {0, l} e -> 9 be 
such that c(0') 6 0' for each 0' C 9 (the function c chooses some element from any 
subset of 9). Let 9' = c(6/*). Let us note that * and therefore 9/* and 9' arc 
exchangeable random objects. Let 



960' 



be the uniform distribution on the finite subset 9' of 9. 

Applying theorem 12.51 to 7r, and p = 5$, we get that for any positive partially 
exchangeable random variable A, with P probability at least 1 — e, for any 0' £ 9', 



A(0> 2 (0') 
Let us choose 



A(0>i(0') < 



A(0') 



)\2 r 



2N 



A(0) = 



H(0') + r 2 (0') 

2^Vlog(^) 
ln(0)+r 2 (0) 



log|0'| +log(e- 1 ). 



1/2 



with the convention that when |r x (0) + r 2 (0) = 0, then Ar 2 (0) = Ar x (0) = 0. This 
is legitimate, since |9'| and ^i(0') + r 2 (0') are exchangeable random variables, and 
since when £ri(0) + r 2 (0) = 0, then n(0) = r 2 (0) = 0. 



Thus, with P probability at least 1 — e, for any 0' € 9', 

'21og(M)[! ri (0') +r2 (0')]' 



1/2 



r 2 (0')-n(0') < 



N 



Now we can remark that for each £ 9, 0' = c[^(0)] is such that fg/(Xi) = fg(Xi), 
for i = 1, .. . , (k+ l)N. Therefore n(6) = n(6') and r 2 (0) = r 2 (6'). 
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Thus with P probability at least 1 — e, for any 9 € 0, 



I N I 

Putting for short d — log(-^-l-) and solving inequality 1)2.1(1 with respect to r^ff) 
proves theorem l2.ll 

Note that we have in fact proved a more general version of theorem 12.11 where 
d can be taken to be d — — log[7f (0)e\ , where 

tt(0) = sup{vr(<9') : 6' € 9, = *(<?)}, 

for any choice of partially exchangeable prior probability distribution it. 

3. Improvement of the variance term 

We will first improve the variance term in lemma l2~2l when k = 1, and P is fully 
exchangeable. We will deal afterwards with the general case. 

Theorem 3.1. For any exchangeable probability distribution P, with P probability 
1 — e, for any 9 € 0, 



r a (0) < rxO?) + ^[1 - 2ri(<?)] + ^ [l - + ^ [l - 2n(0)] 2 , 

where d= infj- log[7r((9')e] : e 6,*(6»') = *(6»)|. 

Let us pursue our numerical example : assuming that = 2, N = 1000, h = 10, 
e = 0.01 and n(0) = 0.2, we get that r 2 (6) < 0.453. 

Proof. Proving theorem 13. II will require some lemmas. 
Let 

T<yh ^ = J2N)\ ^ H^oa), well, 

where &2N is the set of permutations of {1, . . . , 2iV} and where (w o o~)i = u) a u\. 
For any uj £ fi, any cr S ©at, let u>2.o- be defined as 



U<r(i-N), N<i<2N. 



(W2,<r)i 

Let 

r'(/ l )(o;) = ^ £ ^2,.). 

ctGSjv 

Let us remark that r = r o r', and that r' [rfe(0)l = rfc(0), k = 1,2. 

Moreover, we know from the previous section that T[exp(c7)] (ui) < e, where 

A 2 W 

C/ = A[r 2 (0) - n(0)] - - X,) 2 + lo s( £ )- 

Thus r|exp[r'(L7)]| < r o t' [exp([/)] = r[exp([/)] < e, from the convexity of 

the exponential function and the fact that r' is a (regular) conditional probability 
measure. 

I.\lrKi>\ 1.1' Yatmk < oni.\kk hoi \ds i 1 1 : i : ( '.\ n im F;.::i;t \i;^ ^. 2(iii^ 
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But 



t'(U) = \[r 2 (9) ri (0)] - —r'(V) + \ogie- 1 



where V = ± Y,i=i(Xi+N - Xi) 2 - Noticing that 

N ( I N \ ( 1 N 

= N^i + Xi +N ) - 2 - ]T Xi U £ : 

i=l \ i=l / \ i=l 



= r 1 (6)+r 2 {6)-2r 1 (8)r 2 (e), 

we get 

Lemma 3.2. For any exchangeable random variable r], 



T-j^cxp 



A[r 2 (0) - n(0)] -^[r,((?) + r 2 (0) - 2r lv 0)r 2 (0) 



n 



(CO) 



< exp(—r])(ui), u) £ SI. 



As a consequence, 



Lemma 3.3. For any exchangeable probability distribution P, any exchangeable 
prior distribution tt, with P probability at least 1 — e, /or any fie 6, 



r 2 (0) <ri(0) 



_A_ 

27V 



ri(0)+r 2 (0)-2n(0)r 2 (0) 



d 
A' 



w/iere d = inf{- log[7r(^)e] : 0' e 0,*(^) = *(0)}. 

Remark 3.1. As a special case, we can take d = log[N(Xj 2JV )l — log(e). This 
corresponds to the case when tt is chosen to be the uniform distribution on 9', 
using the remark that each fg, 9 € coincides with some fe>, 9' e 0' on the design 
{X i :i = l,...,2N}. 

We would like to prove a little more, showing that it is legitimate to take in the 
previous equation 



A 



2Nd 



r l {9)+r 2 {9)-2r l {9)r 2 (9) 



1/2 



' 2Nd 



This is not so clear, since this quantity is not (even partially) exchangeable. Anyhow 
we can write the following: 



/ 2Nd 



\r 2 



(0) - n(0)| < T'(V- 1/2 )V2Nd\r 2 (9) - n(0)| 



2Nd 



r 2 (e)- ri (e)\), 



because r 



-1/2 



is convex. Moreover, using successively the fact that t'(V) is a 
symmetric function of r\(9) and r 2 (9), the fact that cosh is an even function, the 
previous inequality, the convexity of cosh, the invariance t = r o t', the invariance 
of V under w Oti^Mi an d the fact that V is almost surely constant under 
each n, we get the following chain of inequalities: 
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r{expL^[r 2 (0)-ri(0)] -d + log(e) 



H cosh 
cosh 

< r | cosh 

< t| t' cosh 
cosh 

= rj^exp 



V^hW-n(0)| 



27Vd 
V 



r 2 (9)-r 1 (9)\ 



^\r 2 (9) - ri (9)\ 



fh(«)-r l( 



exp[-d + log(e)] 
exp[— d + log(e)] 
exp[-d + log(e)] 
exp[-d + log(e)] 
exp[-d + log(e)] 
- d + log(e) 



T Oi=iniexp 



^[r 2 (e)-n(S)] -d + log(e 



< e. 



Thus with P probability at least 1 — e, for any flg 6, 



r 2 (0)<n(0) 



2dr'(V) 



N 



n(0) 



l 2d[r 1 {9)+r 2 (9)-2r 1 (9)r 2 ( 



N 



Solving this inequality in r 2 (0) ends the proof of theorem l3.il 



□ 



In the general case when P is only partially exchangeable and k is arbitrary, we 
will obtain the following 

Theorem 3.4. Let d = inf j- log[7r(0)e] : 9' G 8, *((?') = #(0)} and 



B{6) 



2d 
~N 



|r 1 (0) + ^{l + fc- 1 [l-2r 1 (0)]} 



(i + k- 1 )^n(9)[i-n(9)] + ^}- 



For any partially exchangeable probability distribution P, wrai/i P probability at least 
1 - e, for any e 8 such that n(0) < 1/2 and B{9) < 1/2, r 2 (0) < 5(0). 

As a special case, the theorem holds with d = \og[N(x[ k+1)N )] + log( £ - 1 ). 
When using a set of binary classification rules {fg : 9 S 8} whose VC dimension 

'e(7c + l)iV s 



is not greater than h, we can use the bound d < h log 



log(e). The 



result is satisfactory when k is large, because in this case (1 + k ) is close to one. 
This will be useful in the inductive case. 

Let us carry on our numerical example in the binary classification case: taking 
N = 1000, h = 10, e = 0.01 and r x {9) = 0.2, we get a bound B{6) < 0.4203 for 
values of k ranging from 15 to 18, showing that increasing the size of the shadow 
sample has an increased impact when the improved variance term is used. 
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Proof. Let $(p) = (p A |) [l — {p A 5)] . This is obviously a concave function. We 
have proved that 



O^ Wexp A[r 2 (fl)-n^)]-T7 



< exp 



(l + fc- 1 )^ 2 
2N 2 



i=i 



As 



1 V-*/ \ *fri(0)+kr2(9)\ 



this shows that 



0£iT^|exp[A[r 2 (e)-ri(^)] -77 



< exp 



(l + k- l f\* f ri (6) + kr 2 (9) 



2N 



fc + 1 



(l + fc-!) 2 A 2 [ r^O) + kr 2 {d)\ 
Taking 77 = i — I $ ^ > 2 ^ _ l og (e), and 



2N 



A = 



fc+1 
2Nd 



1/2 



where d = inf {- log[7r(0)e] : 9' £ Q, W(0') = #(0)}, we get that with P probability 
at least 1 — e, for any 9 € <d, 



ri{9)-r x {9) < 



2(l + fc" 1 ) 2 $ 



-1X2* f r 1 (e)+kr 2 (e) 



1/2 



A/ 



Solving this inequality in r 2 (0) ends the proof of theorem ^. 41 



□ 



4. The inductive setting 

We will integrate with respect to P(-|Zf ) theorem 12 .11 and its variants. Let us 
start with theorem 13. 41 Let us consider the non identically distributed independent 



case, assuming thus that P = 



5(fe+l) 



Let 



and 



Let 



R{9) = ^Y,P l [Y l ^.fe{X % )} 
r{9) 



i=l 

ri(0) + fcr 2 (0) 



fc+1 



W[h) = W(h\Z?) 
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Lemma 4.1. For any partially exchangeable prior distribution n, any partially 
exchangeable positive function C, : 9 — > B,5_, 



pjsup / 

Use ja=o 



Cexp A[i2(0)-n(0)-C] 

(l + fc- 1 ) 2 ^' / n (9) + kR(8)\ r 



2iV 



fc + 1 y 



p' 



log[7f(0)e] 



dA ^ < e, 



w/iere tt(0) = sup{tt(0') : 0' S 6,*(0') = #(0)}. 
Proof. Let 

(l + fc- 1 )^ 2 f ri (e) + kR(e)\ _, 



C/' = A[i?(0)-r 1 (0)-C] 
Let 



2iV 



p^j+P'|log[*(0) e ] 



C/ - A[r 2 (0) - n(0) - C] - (1 + 2 iV 1)2A2 $ [^)] + log[*(0)e] • 
The function $ being concave, 

Cexp(LT') < Cexp[P'(c7)] < P'[Cexp(L7)] . 

Thus 

r+oo r+oo 

sup/ Cexp(L7')dA < sup / F' [( exp(U)]d\ 
e Jx=o 6e&J\=o 

(r+oo \ / r+oo 

/ Cexp(f/)dA ) < P' I / sup[Cexp(C/)]dA 
J\=0 J \J\=0 8G0 



Moreover 



sup[Cexp(C/)] <7r[Cexp(5)], 



eee 



where 



5=t/-log[7r(0)] =A[r 2 (0)- ri (0)-C] - (1 + k 2 ^ )2X2 Hn 



log(e) 



Thus 



P fsup / (ex.p(U')d\ 
\eeeJ\=o 



< P 



P' 
P 



+ oo 



A=0 



A=0 



7r[Cexp(S')]dA 
7r[Cexp(S')]dA 



P 



P <^ 7T 



A=0 



C(0£i7i) [exp(5)] 
But we have established on the occasion of the proof of theorem 13.41 that 

(O^iTi) [exp(S)] <eexp(-CA). 
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This proves that 

as stated in the lemma. 
Theorem 4.2. Let 



P fsup / Cexp(U')dx) < e, 
\eeeJ\=o J 



□ 



B{6) 



1 



2d' 



n(0) 



d! 2d'n(6)[i-n{ 



N 



N 



+ 



N 2 



where 



and 



d! = dil + k- 1 ) 2 1 - 



log(oQ 
2d 



rf=-p{log[e7f(0)]|Zf}. 
Let us notice that it covers the case when 

d = p{l g[W(xf +1 ) jV )]|^f}+log(e- 1 ). 

In this case, when |^| = 2 and the set of classification rules has a VC dimension 
not greater than h, 



d < h log 



e(k + l)N\ u 
^-^j + log( £ 1 ). 



With P probability at least 1 - e, for any 6 € Q, R{9) < B{9) when n(9) < 1/2 
and B{9) < 1/2. 

In the case when the model has a VC dimension not greater than h, we can 
bound as mentioned in the theorem the random variable d with the constant 



d* = h log 



fe(k + l)N 



log(e" 



We can then optimize the choice of a by taking a = \ \f^. This leads to 



<£ ^d^l + k- 1 ) 2 

We can also approximately optimize 

(l + fc- 1 ) 2 ^ 



1 i ( n l d 
l + ^log( 2e 



eN(k + 1) 



by taking k = 21og(^). 

Let us resume our numerical example to illustrate theorem 14.21 Assume that 
N = 1000, h = 10 and e = 10~ 2 . For n(0) = 0.2, we get B{9) < 0.4257 for k = 19. 
More generally, we get 

B(9) < 0.828 l ri (9) + 0.105 + ^0.209 [l - n{9)] n(9) + 0.011 



For comparison, Vapnik's corollary 11.21 in the same situation gives a bound 
greater than 0.610, and therefore not significant (since a random classification has 
a better expected error rate of 0.5). 
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Proof. Let 



V = (1 + k 



-i } 2^ f n(0) + kR(0) \ 



Let us remark that 



Thus 



fe + 1 J 

d=-P'{log[7f(e)e]}, 
A = R(6)-n(6)-C 



V / JVAV ^A 2 , 
U' = A H d. 



/•+oo 

/ Ccxp(C/')rfA > 1(A > 0)W, 
Ja=o 



where 



/ttTV /iVA 2 , 



Thus, according to the previous lemma, 

p(sup[l(A > 0)W] \ < e. 

This proves that with P probability at least 1 — e, 

sup[l(A > 0)W] < 1. 

Translated into a logical statement this says that with P probability at least 1 — e, 
either A < 0, or log(W) < 0. 

Let V = (1 + fc _1 )<I>[.R(0)]. Consider setting ( = oi^J^-, where a is some 
positive real number. 

We have proved that with P probability at least 1 — e, 

— <d-log(a) + -log^- 

when A > 0. But $ is increasing and when A > 0, R{9) > r±(8), thus in this case 
V > V, and we can weaken and simplify our statement to 

NA 2 

-t^t < d-log(a). 
Equivalently, with P probability at least 1 — e, for any ^£6, 



N \ V d V^d 



Using the fact that \/l + x < 1 + §, we get that 



[R(e)-n(9)] 2 < 



2d'$[R(6)] 



N 

2 



where d' = d(l + /T 1 ) 2 (l - ^§§^ + ) . Since $(i?) = - JZ) when 

\ 2d V7rd/ 

i? < 1/2, this can be solved in R(9) in this case to end the proof of theorem 

IO □ 

I.MrHn\ ± _ i • Y.U'Mk ('i jhomakk iu.h \ns < )uviKi; ( '.\ r< >m Fs-.:;i-;i . \ i ; - s s. 2(Kis 
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With a little more work we could have kept 

iVA 2 



2V 

leading to 



< d-log(a), 



R(9)-r 1 (0) < 



N \ nN 



'2V[d-]og(or)] t [W 7 



2 . 2V[d-log(a)] V ld-loga 2 2V 

and [R(e)- ri (e)] < l —— J.+4a— 1' 



a 



N N V 7T nN 

This leads to the following 

Theorem 4.3. Let us put 



/d-log(a) , a 2 



7T 7T 

.-1\2. 



di =d-\og(a) + (l + k- l y c , 
d 2 = (1 + k- 1 ){ [d - log(a)] [1 - + (1 + fc-^c}, 

d 3 = (1 + AT 1 ) 2 ^ - log(a) + c + 2j^ [d - log(a)] }, 
d 4 = (1 + A:" 1 ) [d - log(a) + (1 + fc-^c] . 
Theorem\4-.S\ still holds when the bound B{9) is strengthened to 



2d 1 Y 1 d 2 /2d 3 ri(0)[l-ri(0)] d 2 



4 



AT / I w N V ^ ^ 2 

On the previous numerical example (N = 1000, h = 10, e = 10~ 2 , fc = 19, 
a = ly^, ri(6») = 0.2), we get a bound B(9) < 0.4248, instead of B(9) < 0.4257, 
showing that the improvement brought to thcorcm l4.2l is not so strong, and therefore 
that theorem l4.2l is a satisfactory approximation of theorem 14.31 

Starting from lemma 12.21 we can make the same kind of computations taking 
V = (1 + fc-i) ri W+kfl(9) | to btain that with P probability at least 1 - e, 



^ V 2d v^rd 
where V' = (1 + k^ 1 )R(9). This proves the following 

Theorem 4.4. For any positive constant a, with P probability at least 1 — e, for 
any 9 £ O, 

^ d' hd'nlO) d' 2 " 

i?(0) < n((9) + — + \ — + — , 

JV | JV TV 2 

2 



where d' = (1 + fc" 1 ) (l - i 2 ^ 1 + ^) d 



Our previous numerical application gives in this case a non significant bound 
R(9) < 0.516, (for the best value of k = 9), showing that the improvement of the 
variance term has a decisive impact when r% (9) is not small. 
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In the fully exchangeable case, when k = 1, a slightly better result can be 
obtained, using lemma EQ1 and thus putting 

V = n{9) +R{9) -2rx(6)R(6), 

V = 2R{6) [1 - R(0)] . 
It leads to the following theorem 
Theorem 4.5. Let d! = d — log(a) and 



c = 2a 



d - log(a) 



or 

7T 



Theorem 



still holds when the bound is tightened to 



^>=(l + £)" W*)+[l-2r l( *)]£ + | 



U(d' + c)n(0)[i - n(d)] 



N 



N 2 



[l-2n( 



4c(d' 



N 2 



Remark 4.1. Our previous numerical example gives in this case a bound B(9) < 
0.460, (for a = \-\fW)- This shows that the improvement brought by a better 
variance term is significant, but that the optimization of the size of the shadow 
sample is also interesting. 



Remark 4.2. Note that we can take a = 1. In this case, d' — d and c = 2\/ — + —. 
Note that we can also take a = d -1 / 2 , leading to d' = d + | log(d) and 

log(dT 



1 



log(d) 



1 2 
nd ~ \[~k 



1 



1 3 
?rd ~ 2' 



0r V ' 2d ' 7rd ^ ' 4d 

Remark 4.3. Note also that the bound can be weakened and simplified to 



B(0)<n(0) + [l-2n( 



~N 



'4d"ri(6»)[l -n 



[1 - 2n 



TV 2 

where d" = dt + 2c. Taking a = d^ 1 / 2 gives d" <d+\ log(d) + 3. 

Another technical possibility to get inductive bounds is to choose some near 
optimal value for A, instead of averaging over some exponential prior distribution 
on A. 

This leads to the following theorem 
Theorem 4.6. Let 

d = p{log[7f(e)- 1 e- 1 ]} > 

d=p{log[7f(0)- 1 £- 1 ]|Zf}, 



d! 



i 



(i 



Theorem 

B{6) 



, , k- l f(d + d){l + 1 ) 
still holds when the bound is tightened to 



1 



2d' 



n(6) 



N 



'2d'ri(6»)[l -n 



N 



N 2 
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Moreover, putting d* — ess sup P log[7r(0) 1 e x ] , d' can be replaced with d*(l+k 1 ) 2 
in the previous bound. In the case of a VC class of dimension h, d* can be bounded 
byhlog(^). 

Following our numerical example (N = 1000, h = 10, r\{9) = 0.2), we get an 
optimal value of B(9) < 0.4213 for k ranging from 17 to 19. This shows that in 
this case going from the transductive setting to the inductive one was done with 
an insignificant loss of 0.001. Although making use of a rather cumbersome flavor 
of entropy term in the general case, theorem 14 . 61 provides the tightest bound in the 
case of a VC class. 



Proof. Starting from 



sup exp 
Use 



X(d)[R(6)-n 
\{9f 



2N 



(l + fc -T*' riW + fciW 



we can choose 



A 



1/2 



1 + J 

-2iVp{log[7r(0)e]} 
(l + fc- 1 ) 2 *^)] 

We get with P probability at least 1 — e, 

( ri (8) + kR(8) 

WW] 



P'{log[^)e]} 



< e, 



X[R(9) - n(9)] < d 



We can then remark that whenever R(9) > ri(9), then 



i- 1 (fl) + fcH(0) 



< 1, to get 



Solving this inequality in R(9) ends the proof of theorem 14. 61 



□ 



In the same way, in the fully exchangeable case, starting from 



sup exp 
Use 



\(0)[R(0) - n(6)] 



m 2 

2N 



[R{0) + n(6) - 2n(9)R(0)] +d 



< e, 



we can take 



to get 
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1,/ 



Theorem 4.7. Let d> = jd\l+ j) , and assume that P z S fully exchangeable, 
Theorem\4-.'^\ still holds when the bound is tightened to 



2d 'Y) ( n\ , d ' , , d' 2 



Moreover, putting d* = ess sup P log[7r(#) 1 e d' can be replaced with 2d* in the 
previous bound. In the case of a VC class of dimension h, d* can be bounded by 

Our numerical example (N — 1000, h = 10, e = 0.0.1 and Ti(9) = 0.2), gives a 
bound B(0) < 0.445. 



5. Using relative bounds 

Relative bounds were introduced in the PhD thesis of our student Jean- Yves 
Audibert [2J- Here we will use them to sharpen Vapnik's bounds when ri(6) and AT 
are large (a flavor of how large they should be is given in the numerical application 
at the end of this section). Audibert showed that chaining relative bounds can be 
used to remove log(iV) terms in Vapnik bounds. Here, we will generalize relative 
bounds to increased shadow samples and will use only one step of the chaining 
method (lest we would spoil the constants too much, the price to pay being a 
trailing log[log(A)] term which anyhow behaves like a constant in practice). 

Let us assume that P is partially exchangeable. Let 9, 9' £ 0, and let 

X, = l[Yi ? fe(Xi)] -l[Yi? fo>(Xi)], 

N 

Xi 



r[(9,9')=r 1 (e)-r 1 (e') = ±<r 

i=i 
1 (*-,-■ 

r' 2 (9,9') = r 2 (e)-r 2 (9') = — £ 



(k+l)N 

Xi- 



kN 

i=N+l 

For any real number x, let g(x) — x~ 2 [exp(x) — 1 — x] . As it is well known, 
x i— ► g{x) : P — > P is an increasing function. This is the key argument in the proof 
of Bernstein's deviation inequality. 

Let 

(k+l)N 

Lemma 5.1. For any partially exchangeable random variable A : SI — > IR, ? 



{Olrn)exJ\[r 2 (9,0')-r[(9,9')] 



N 

Improved Vapnik C'eryonenkis bounds Oliyiku ('atom Fujkcary 8, 2008 
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Proof. For any partially exchangeable random variable rj 
lGg|(0£i7i)exp 



} 



= -77 + £ login exp AQ^ Xi+jjV _ x A | 
= -V + J2 lo g{ ex P ( 7^ ^ Xi+jNjn expf -(1 + O^Xi ) } 

= - r '+^ E Xi + X>g{Jexp(-(l + O^Xi)]} 

i— 1 z— 1 ^ ' 



Now we can apply Bernstein's inequality to 
log jr; exp 

to show that 



\og< n exp 



) X k 



+ (l + k 



3=0 
1\2 



N 2 



2A 
~N 



(1 + fc- 1 ) 



1 * 

where we have put pi — Tj(xi) = t E^ i+jAr ' Anyhow, ^ us reproduce 



j=0 

the proof of this statement here, for the sake of completeness. Let us put a — 

N- 



logjri ^cxp(-axi)] } = ~ap 

+ l0g|l+T, 



exp 

< -api + t, 



We can now use the bound r, 
X?<l[M*i)^ to get 



-a(Xi - Pi)] ~ 1 - "(Xi - Pi) | 

« 2 (Xi -Pj) 2 3[-a(Xj - Pi)] 

< -ap t + g{2a)n a 2 (xi - Pi) 2 ■ 

(Xi ~ Pi) 2 < T (Xi) an( i remark that 



lo gj(CX=in) exp 



X[r' 2 (9,e')-r[(9,9')] - v 

N 

< g(2a)a 2 £ r,{l [/ e (X,) ^ (JQ)] } - r? 

2A(l + fc- 1 ) 



AT 



= -(i+ fc - i ) 2 . 9 ( ^ v ^^ >- )mo')-v. 
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We end the proof by choosing 

'2(1 + At 1 )* 



V = 9 



N 



(l + fc-y-WO-Me). 



□ 



We deduce easily from the previous lemma the following 



Proposition 5.2. For any partially exchangeable prior distributions it, 
7r' : Q, — > M\_ (0) ; for any partially exchangeable probability measure P £ M+(f2). 
with P probability at least 1 — e, for any 9, 9' £ O, 

r> 2 (6,6i)-r[(e,ei)< g ( 2{1 + k ~ 1)y 



N 



where ir(9) = sup{7r( 
used for n' . 



£ = and an analogous definition is 



Let us now assume that we use a set of binary classification rules {/g : 9 £ 0} 
with VC dimension not greater than h. 

Let us consider in the following the values 

L(fc + l)7Vexp(-j)J 



: exp(-j), 



(fc + l)iV 

where [a;J is the lower integer part of the real number x. Let us define 

~2e 2 (k + 1)N~ 



dj = h log 



Hi 



+ log|elog(JV)(ft + l)e" 



Proposition 5.3. With P probability at least 1 ~ e, for any 9 £ 0, any 

j £ {1, . . . , |k>g(A0J }, tfiere is e 6^ such that 



r 2 (e)-ri(e)<r 2 (6' j )-r 1 m + 



I 8dj 



2(1 + k-^dj 



N 



Proof. Let us recall a lemma due to David Haussler [S] : when the VC dimension 
of {fg : 9 £ 6} is not greater than h, then, for any £ = n^fjvft , we can find some 
^-covering net 9| C 9 for the distance i (which is a random exchangeable object), 
such that 



|6^|<e(/i+l) 



Let us put on 

U &e, 

j,l<j<log(JV) 

the prior probability distribution defined by 



tt'(^) = (Liog(^)j|e^|)- 



> 



We see that with P probability at least 1 — e, for any 9 G 0, any j, 1 < j < log (TV) 
there is G such that £(#, ^ ) < £j, and therefore such that 

LuriM ivi-;d Vapnik ( oxi \ ...- kolwds ' 
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r2(0,0')-r 1 (O,9')<g 



< 1 + i -'»^ 



(i +O a £{ 

1 



+ -log[^(0)- 1 7f'(^)- 1 e- 1 ], 



where we can take n(9) > ^ e ( fc + 1 ) JV ^ anc [ where 7r'(^-) has been defined earlier. 
We can then choose 

^jVlogjfTf^Tf'^pe]" 1 }' 



A(M-) = 



(1 + fc-i)^- 



to prove proposition [5 



□ 



On the other hand, from theorem 13 . 41 applied to |J 0^. and tt' , we see that with 
P probability at least 1 — e, for any j, 1 < j < log (AT) and any 9j £ 0^, putting 



2c 



dj = fclog ( — ) + log[elog(iV)(/i + 1)] - log(e), 



we have 



We can then remark that when t(d,6'^) < £j, 

-^-[r^+kr^)] < -J—[r 1 (e)+kr 2 (e)]+£(0,9' j ) < J_ [ ri (0)+fcr 2 (0)]+fc. 



We have proved the following 

Theorem 5.4. Wiift P probability at least 1 — 2e, 



r 2 (0)-ri(0)< inf 

i6M*,l<j'<log(JV) 



.9 



2(1 + fc" 1 ) 2 ^ 



iV 



^ (1+t - 1) v, t (^±^ +6 



y 3 \ 1 + 

Remark 5.1. To use this theorem, we have to solve equations of the type 

'n + fcr 2 



r 2 - ri < a + 6 



1 + fc 



1/2 



Whenever n and the bound are less than 1/2, this is equivalent to 



r 2 < 



B + VB 2 - AC 
A ' 



where 



A = l 



kb 
1 + k 



B = n+ a 



kb 2 



C = (n + a) 2 - 



2(1 + k) 2 
b 2 



(1 + k) 



[(l + fc)(l-20-2n], 
[(l + fe)€ + ri][(l + fc)(l-0-ri] 
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Let us make some numerical application. We should take N pretty large, be- 
cause the expected benefit of this last theorem is to improve on the log(iV) term 
(the optimization in ^ allows to kill the log(iV) term in dj and be left only with 
log[log(iV)] terms). So let us take N = 10 6 , h = 10, n(0) = 0.2 and e = 0.005. For 
these values, theorem 13.41 gives a bound greater than 0.2075 and less than 0.2076 
when k ranges from 24 to 46. Here we obtain a bound less than 0.2070 for k ranging 
from 24 to 46, the optimal values for (k, j) being (257, 7), giving a bound less than 
0.20672. The bound is less than 0.2068 for k ranging from 42 to 19470, showing 
that we can really use big shadow samples with theorem !5.4l ! 
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